84

Algorithms for Binary Neural Networks

(a) One-stage

(b) Two-stage

FIGURE 3.27

Effect of hyperparameters λ and τ on one- and two-stage training using 1-bit ResNet-18.

which is termed the ApproxSign function and is used for the backpropagation gradient

calculation of the activation. Compared to the traditional STE, ApproxSign has a shape

similar to that of the original binarization function sign, and thus the activation gradi-

ent error can be controlled to some extent. Similarly, CBCN [149] applies an approximate

function to address the gradient mismatch from the sign function. MetaQuant [38] intro-

duces Metalearning to learn the gradient error of weights using a neural network. IR-Net

[196] includes a self-adaptive Error Decay Estimator (EDE) to reduce the gradient error in

training, which considers different requirements on different stages of the training process

and balances the update ability of parameters and reduction of gradient error. RBNN [140]

proposes a training-aware approximation of the sign function for gradient backpropagation.

In summary, prior art focuses on approximating the gradient derived from

ba

ai,j or

bw

wi,j .

Unlike other approaches, our approach focuses on a different perspective of the gradient

approximation, i.e., gradient from

∂G

wi,j . Our goal is to decouple A and w to improve the

gradient calculation of w. RBONN manipulates w’s gradient from its bilinear coupling

variable A ( ∂G(A)

wi,j ). More specifically, our RBONN can be combined with the prior art by

comprehensively considering

∂LS

ai,j ,

∂LS

wi,j and

∂G

wi,j in the backpropagation process.

3.8.4

Ablation Study

Hyper-parameter λ and τ. The most important hyper-parameter of RBONN are λ and

τ, which control the proportion of LR and the threshold of backtracking in recurrent bilinear

optimization. On ImageNet for 1-bit ResNet-18, the effect of hyperparameters λ and τ is

evaluated under one- and two-stage training. The performance of RBONN is demonstrated

in Fig. 3.27, where λ ranges from 1e3 to 1e5 and τ ranges from 1 to 0.1. As observed, with

λ reducing, performance improves at first before plummeting. The same trend emerges when

we increase τ in both implementations. As demonstrated in Fig. 3.27, when λ is set to 1e4

and τ is set to 0.6, 1-bit ResNet-18 generated by our RBONN gets the best performance. As